{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "text_classification_with_hub.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "private_outputs": true,
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.2"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Ic4_occAAiAT"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "ioaprt5q5US7",
        "colab": {}
      },
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ItXfxkxvosLH"
      },
      "source": [
        "# 使用 Keras 和 Tensorflow Hub 对电影评论进行文本分类"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hKY4XMc9o8iB"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://tensorflow.google.cn/tutorials/keras/text_classification_with_hub\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\" />在 tensorFlow.google.cn 上查看</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs/blob/master/site/zh-cn/tutorials/keras/text_classification_with_hub.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\" />在 Google Colab 中运行</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/tensorflow/docs/blob/master/site/zh-cn/tutorials/keras/text_classification_with_hub.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\" />在 GitHub 上查看源代码</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a href=\"https://storage.googleapis.com/tensorflow_docs/docs/site/zh-cn/tutorials/keras/text_classification_with_hub.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\" />下载 notebook</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eDesG2pcNY-E",
        "colab_type": "text"
      },
      "source": [
        "Note: 我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为， 所以无法保证它们是最准确的，并且反映了最新的\n",
        "[官方英文文档](https://www.tensorflow.org/?hl=en)。如果您有改进此翻译的建议， 请提交 pull request 到\n",
        "[tensorflow/docs](https://github.com/tensorflow/docs) GitHub 仓库。要志愿地撰写或者审核译文，请加入\n",
        "[docs-zh-cn@tensorflow.org Google Group](https://groups.google.com/a/tensorflow.org/forum/#!forum/docs-zh-cn)。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Eg62Pmz3o83v"
      },
      "source": [
        "\n",
        "此笔记本（notebook）使用评论文本将影评分为*积极（positive）*或*消极（nagetive）*两类。这是一个*二元（binary）*或者二分类问题，一种重要且应用广泛的机器学习问题。\n",
        "\n",
        "本教程演示了使用 Tensorflow Hub 和 Keras 进行迁移学习的基本应用。\n",
        "\n",
        "我们将使用来源于[网络电影数据库（Internet Movie Database）](https://www.imdb.com/)的 [IMDB 数据集（IMDB dataset）](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb)，其包含 50,000 条影评文本。从该数据集切割出的 25,000 条评论用作训练，另外 25,000 条用作测试。训练集与测试集是*平衡的（balanced）*，意味着它们包含相等数量的积极和消极评论。\n",
        "\n",
        "此笔记本（notebook）使用了 [tf.keras](https://www.tensorflow.org/guide/keras)，它是一个 Tensorflow 中用于构建和训练模型的高级API，此外还使用了 [TensorFlow Hub](https://www.tensorflow.org/hub)，一个用于迁移学习的库和平台。有关使用 `tf.keras` 进行文本分类的更高级教程，请参阅 [MLCC文本分类指南（MLCC Text Classification Guide）](https://developers.google.com/machine-learning/guides/text-classification/)。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "2ew7HTbPpCJH",
        "colab": {}
      },
      "source": [
        "from __future__ import absolute_import, division, print_function, unicode_literals\n",
        "\n",
        "import numpy as np\n",
        "\n",
        "try:\n",
        "  # Colab only\n",
        "  %tensorflow_version 2.x\n",
        "except Exception:\n",
        "    pass\n",
        "import tensorflow as tf\n",
        "\n",
        "import tensorflow_hub as hub\n",
        "import tensorflow_datasets as tfds\n",
        "\n",
        "print(\"Version: \", tf.__version__)\n",
        "print(\"Eager mode: \", tf.executing_eagerly())\n",
        "print(\"Hub version: \", hub.__version__)\n",
        "print(\"GPU is\", \"available\" if tf.config.experimental.list_physical_devices(\"GPU\") else \"NOT AVAILABLE\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iAsKG535pHep"
      },
      "source": [
        "## 下载 IMDB 数据集\n",
        "IMDB数据集可以在 [Tensorflow 数据集](https://github.com/tensorflow/datasets)处获取。以下代码将 IMDB 数据集下载至您的机器（或 colab 运行时环境）中："
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "zXXx5Oc3pOmN",
        "colab": {}
      },
      "source": [
        "# 将训练集按照 6:4 的比例进行切割，从而最终我们将得到 15,000\n",
        "# 个训练样本, 10,000 个验证样本以及 25,000 个测试样本\n",
        "train_validation_split = tfds.Split.TRAIN.subsplit([6, 4])\n",
        "\n",
        "(train_data, validation_data), test_data = tfds.load(\n",
        "    name=\"imdb_reviews\", \n",
        "    split=(train_validation_split, tfds.Split.TEST),\n",
        "    as_supervised=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "l50X3GfjpU4r"
      },
      "source": [
        "## 探索数据\n",
        "\n",
        "让我们花一点时间来了解数据的格式。每一个样本都是一个表示电影评论和相应标签的句子。该句子不以任何方式进行预处理。标签是一个值为 0 或 1 的整数，其中 0 代表消极评论，1 代表积极评论。\n",
        "\n",
        "我们来打印下前十个样本。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "QtTS4kpEpjbi",
        "colab": {}
      },
      "source": [
        "train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))\n",
        "train_examples_batch"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "IFtaCHTdc-GY"
      },
      "source": [
        "我们再打印下前十个标签。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "tvAjVXOWc6Mj",
        "colab": {}
      },
      "source": [
        "train_labels_batch"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LLC02j2g-llC"
      },
      "source": [
        "## 构建模型\n",
        "\n",
        "神经网络由堆叠的层来构建，这需要从三个主要方面来进行体系结构决策：\n",
        "\n",
        "* 如何表示文本？\n",
        "* 模型里有多少层？\n",
        "* 每个层里有多少*隐层单元（hidden units）*？\n",
        "\n",
        "本示例中，输入数据由句子组成。预测的标签为 0 或 1。\n",
        "\n",
        "表示文本的一种方式是将句子转换为嵌入向量（embeddings vectors）。我们可以使用一个预先训练好的文本嵌入（text embedding）作为首层，这将具有三个优点：\n",
        "\n",
        "* 我们不必担心文本预处理\n",
        "* 我们可以从迁移学习中受益\n",
        "* 嵌入具有固定长度，更易于处理\n",
        "\n",
        "针对此示例我们将使用 [TensorFlow Hub](https://www.tensorflow.org/hub) 中名为 [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1) 的一种**预训练文本嵌入（text embedding）模型** 。\n",
        "\n",
        "为了达到本教程的目的还有其他三种预训练模型可供测试：\n",
        "\n",
        "* [google/tf2-preview/gnews-swivel-20dim-with-oov/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim-with-oov/1) ——类似 [google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1)，但 2.5%的词汇转换为未登录词桶（OOV buckets）。如果任务的词汇与模型的词汇没有完全重叠，这将会有所帮助。\n",
        "* [google/tf2-preview/nnlm-en-dim50/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1) ——一个拥有约 1M 词汇量且维度为 50 的更大的模型。\n",
        "* [google/tf2-preview/nnlm-en-dim128/1](https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1) ——拥有约 1M 词汇量且维度为128的更大的模型。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "In2nDpTLkgKa"
      },
      "source": [
        "让我们首先创建一个使用 Tensorflow Hub 模型嵌入（embed）语句的Keras层，并在几个输入样本中进行尝试。请注意无论输入文本的长度如何，嵌入（embeddings）输出的形状都是：`(num_examples, embedding_dimension)`。\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "_NUbzVeYkgcO",
        "colab": {}
      },
      "source": [
        "embedding = \"https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1\"\n",
        "hub_layer = hub.KerasLayer(embedding, input_shape=[], \n",
        "                           dtype=tf.string, trainable=True)\n",
        "hub_layer(train_examples_batch[:3])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dfSbV6igl1EH"
      },
      "source": [
        "现在让我们构建完整模型："
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "xpKOoWgu-llD",
        "colab": {}
      },
      "source": [
        "model = tf.keras.Sequential()\n",
        "model.add(hub_layer)\n",
        "model.add(tf.keras.layers.Dense(16, activation='relu'))\n",
        "model.add(tf.keras.layers.Dense(1, activation='sigmoid'))\n",
        "\n",
        "model.summary()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "6PbKQ6mucuKL"
      },
      "source": [
        "层按顺序堆叠以构建分类器：\n",
        "\n",
        "1. 第一层是 Tensorflow Hub 层。这一层使用一个预训练的保存好的模型来将句子映射为嵌入向量（embedding vector）。我们所使用的预训练文本嵌入（embedding）模型([google/tf2-preview/gnews-swivel-20dim/1](https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1))将句子切割为符号，嵌入（embed）每个符号然后进行合并。最终得到的维度是：`(num_examples, embedding_dimension)`。\n",
        "2. 该定长输出向量通过一个有 16 个隐层单元的全连接层（`Dense`）进行管道传输。\n",
        "3. 最后一层与单个输出结点紧密相连。使用 `Sigmoid` 激活函数，其函数值为介于 0 与 1 之间的浮点数，表示概率或置信水平。\n",
        "\n",
        "让我们编译模型。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "L4EqVWg4-llM"
      },
      "source": [
        "### 损失函数与优化器\n",
        "\n",
        "一个模型需要损失函数和优化器来进行训练。由于这是一个二分类问题且模型输出概率值（一个使用 sigmoid 激活函数的单一单元层），我们将使用 `binary_crossentropy` 损失函数。\n",
        "\n",
        "这不是损失函数的唯一选择，例如，您可以选择 `mean_squared_error` 。但是，一般来说 `binary_crossentropy` 更适合处理概率——它能够度量概率分布之间的“距离”，或者在我们的示例中，指的是度量 ground-truth 分布与预测值之间的“距离”。\n",
        "\n",
        "稍后，当我们研究回归问题（例如，预测房价）时，我们将介绍如何使用另一种叫做均方误差的损失函数。\n",
        "\n",
        "现在，配置模型来使用优化器和损失函数："
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Mr0GP-cQ-llN",
        "colab": {}
      },
      "source": [
        "model.compile(optimizer='adam',\n",
        "              loss='binary_crossentropy',\n",
        "              metrics=['accuracy'])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "35jv_fzP-llU"
      },
      "source": [
        "## 训练模型\n",
        "\n",
        "以 512 个样本的 mini-batch 大小迭代 20 个 epoch 来训练模型。 这是指对 `x_train` 和 `y_train` 张量中所有样本的的 20 次迭代。在训练过程中，监测来自验证集的 10,000 个样本上的损失值（loss）和准确率（accuracy）："
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "tXSGrjWZ-llW",
        "colab": {}
      },
      "source": [
        "history = model.fit(train_data.shuffle(10000).batch(512),\n",
        "                    epochs=20,\n",
        "                    validation_data=validation_data.batch(512),\n",
        "                    verbose=1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9EEGuDVuzb5r"
      },
      "source": [
        "## 评估模型\n",
        "\n",
        "我们来看下模型的表现如何。将返回两个值。损失值（loss）（一个表示误差的数字，值越低越好）与准确率（accuracy）。"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "zOMKywn4zReN",
        "colab": {}
      },
      "source": [
        "results = model.evaluate(test_data.batch(512), verbose=2)\n",
        "for name, value in zip(model.metrics_names, results):\n",
        "  print(\"%s: %.3f\" % (name, value))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "z1iEXVTR0Z2t"
      },
      "source": [
        "这种十分朴素的方法得到了约 87% 的准确率（accuracy）。若采用更好的方法，模型的准确率应当接近 95%。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5KggXVeL-llZ"
      },
      "source": [
        "## 进一步阅读\n",
        "\n",
        "有关使用字符串输入的更一般方法，以及对训练期间准确率（accuracy）和损失值（loss）更详细的分析，请参阅[此处](https://www.tensorflow.org/tutorials/keras/basic_text_classification)。"
      ]
    }
  ]
}
